- Wed 17 July 2019
- Projects
- Tony Hall
- #Foursquare ABS Suburbs Recommender Recommendation
This project utilizes publicly available data to recommend 10 similar suburbs in Australia to a given suburb. This could be useful for people trying to decide where to work, rent or buy property. For example, it could be used on property search websites; a person searches for property in a given suburb and is provided a list of similar suburbs to consider buying in. There are three primary data points used to determine how similar other suburbs are to the given suburb;
- Foursquare Places API - current data on the venues (cafes, restaurants, parks, amenities etc) most often found in the suburb
- Australian Bureau of Statistics API – data on the cultural background (i.e. ancestry) of the people in the suburb
- RealestateVIEW.com.au – Age and income demographics of the suburb
For example, given the suburb ‘Brunswick West’ (my home suburb in Melbourne) the engine provides 10 suburbs which have similar venues (typically cafes, pubs and Middle Eastern restaurants), similar age and income residents (older middle class), and similar cultural heritage (English, Italian, Greek).
It utilizes K-Means clustering from Scikit-learn to cluster venue and cultural data, the pandasdmx package to interface with the Australian Bureau of Statistics API, and the Beautiful Soup package to scrape demographic information from the web.
#pandasdmx is used when calling the Australian Bureau of Statistics API
!pip install pandasdmx
import pandas as pd
import numpy as np
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import requests
from sklearn.cluster import KMeans
from pandasdmx import Request
from bs4 import BeautifulSoup
Get suburb and postcode data¶
List of Australian postcodes and suburbs are available online on many websites. I've replicated the file from http://www.corra.com.au/australian-postcode-location-data/ on Github
filename = "https://raw.githubusercontent.com/DataistDogma/Coursera_Capstone/master/Australian_Post_Codes_Lat_Lon/Australian_Post_Codes_Lat_Lon.csv"
postcodes = pd.read_csv(filename)
print("Postcodes shape: ", postcodes.shape)
postcodes.head()
Each row represents a Distribution Centre (dc) rather than a suburb or postcode. We're looking for a list of unique suburbs, so we'll filter the dataframe for unique suburbs and remove the 'dc' and 'type' column, which we don't need
postcodes.drop_duplicates(subset='suburb', inplace=True)
postcodes.drop(['dc', 'type'],axis=1,inplace=True )
postcodes.dropna(inplace=True) #drop the trailing row which has no data
postcodes.reset_index(inplace=True, drop=True) #reinex after having removed some rows above
postcodes.shape
For simplicity we'll also create a smaller subset of suburbs, being those in Metropolitan Melbourne, the second-largest city in Australia. The postcode range for Melbourne are postcode between 3000-3207 & 8000-8499
melb_postcodes = postcodes[(postcodes['postcode']>=3000)&(postcodes['postcode']<=3207)|(postcodes['postcode']>=8000)&(postcodes['postcode']<=8499)]
melb_postcodes.head()
Assign venue clusters¶
Use the Foursquare API to get venue recommendations for each suburb and then use K-means to assign a venue cluster to the suburb based on the recommendations returned
#intialise Foursquare API credentials
CLIENT_ID = 'EKC0OWGJC1SY1AE1UHB4PUPH2JGARZTQK1U5C1USTUNA43JF' # your Foursquare ID
CLIENT_SECRET = 'CHH0EPUEN2PH4WBDV4XHTDT5NWTUZ2SYVIQALUOZXWYNABRO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
#Use the function to extract the category from the dataframe (because the column name could be either 'categories' or 'venue.categories')
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
#define a function to return all the venues for a given neighborhood
#pass the neighborhood (nb), the latitude (la), the radius and the limit
def get_venues(nb, la, lo, radius, limit):
#print(nb)
#form the request url and request only the items (which are the venues)
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, la, lo, VERSION, limit)
items = requests.get(url).json()["response"]['groups'][0]['items']
#include exception handling where Foursquare request fails for a particular Neighbourhood
if items == []:
print(" -- Foursquare request for ",nb,"returned no results --")
return None
else:
# flatten JSON, filter for only wanted columns then use the get_category_type funtion to replace the category list with just the category.
venues1 = json_normalize(items)
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
venues1 =venues1.loc[:, filtered_columns]
venues1['venue.categories'] = venues1.apply(get_category_type, axis=1)
#define a new dataframe with the Neighbourhood information
venues2 = pd.DataFrame(columns=['Suburb','Suburb Latitude',
'Suburb Longitude'])
#for each of the venues returned, add the venue dataframe to the (empty) neighbourhood dataframe and fill all rows in the the neighbourhood columns with the neighborhood name, the neighbourhood latitude and longitude
i=0
for row in venues1:
venues2[row]=venues1[row]
venues2['Suburb']=nb
venues2['Suburb Latitude']=la
venues2['Suburb Longitude']=lo
i=i+1
#rename the columns
venues2.rename(index=str,columns={"venue.name":"Venue","venue.categories":"Venue Category","venue.location.lat":"Venue Latitude","venue.location.lng":"Venue Longitude"}, inplace=True)
return venues2
#test out the get_venues function
row = 300
n = postcodes.loc[row]['suburb']
n_lat = postcodes.loc[row]['lat']
n_long = postcodes.loc[row]['lon']
df2=get_venues(n, n_lat, n_long, 500, 100)
df2.head()
#Iterate thorough all the suburbs of the dataframe df, adding each set of venues to the df_suburb_venues dataframe
df=melb_postcodes
df_suburb_venues=pd.DataFrame()
radius = 500
limit = 100
print("Getting venue recommendations, please wait......")
for i, row in enumerate(df['suburb']):
df_suburb_venues=df_suburb_venues.append(get_venues(df.iloc[i]['suburb'], df.iloc[i]['lat'], df.iloc[i]['lon'], radius, limit),ignore_index=True)
print("...... Complete")
counts = df_suburb_venues['Venue Category'].value_counts()
counts.head(10)
#Prepare the data for clustering using one-hot encoding
df_onehot = pd.get_dummies(df_suburb_venues['Venue Category'])
#insert a new column for the Neighbourhood values (note that immediately reusing the 'Neighborhood' name causes as error as you can't add a Neighourhood column that already exists, so renamed the column post the insert)
df_onehot.insert(0,"New",df_suburb_venues['Suburb'])
df_onehot.rename(columns={'New':'Suburb'}, inplace=True)
#get the mean frequency occurance
df_onehot = df_onehot.groupby("Suburb").mean().reset_index()
df_onehot.head()
#put the top10 into a dataframe
df_topvenues=pd.DataFrame(columns=['Suburb','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])
for row in range(0,len(df_onehot)):
suburb = df_onehot.iloc[row][0]
topten = df_onehot.iloc[row][1:].sort_values(ascending=False).head(10)
temp=pd.DataFrame([[suburb,topten.index[0],topten.index[1],topten.index[2],topten.index[3],topten.index[4],topten.index[5],topten.index[6],topten.index[7],topten.index[8],topten.index[9]]],columns=['Suburb','#1','#2','#3','#4','#5','#6','#7','#8','#9','#10'])
df_topvenues=df_topvenues.append(temp)
df_topvenues.reset_index(inplace=True, drop=True)
df_topvenues.head()
Assign a cluster based on the recommendations using the k-means clustering algorithm from Scikit Learn¶
# import k-means
from sklearn.cluster import KMeans
df_clustering = df_onehot.drop('Suburb',axis=1)
# set number of clusters
k = 5
# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]
df_topvenues.insert(0,'Venues Cluster',kmeans.labels_)
df_merged = postcodes
df_merged = df_merged.join(df_topvenues.set_index('Suburb'), on='suburb')
#drop rows with no clusters
df_melbourne_merged = df_merged.dropna(subset=['Venues Cluster'])
df_melbourne_merged.head()
Assign cultural clusters¶
Use the Australian Beaureu of Statistics API to get a breakdown on the ancestry for the suburb suburb and then use K-means to assign a venue cluster suburbs with similar ancestry
#create the url which returns the ABS cencus data for the year 2011 for the question T09 - Ancestry by birthplace of parents
allvic_url='ABS_CENSUS2011_T09/TOT+1+2+3+4+Z.TOT+TOTP+1101+1102+6101+3204+2303+2101+5201+2305+2306+3205+3304+7106+2201+3103+6902+4106+3206+3104+1201+1202+3307+3308+2102+3213+7115+9215+3106+4907+5107+2103+OTH+Z.2.SA2..A/all?detail=Full&dimensionAtObservation=AllDimensions'
#create request object
abs = Request('ABS')
#create response object
response = abs.data(allvic_url)
#create dataset object
dataset=response.data
##create top level series
data=response.write()
data.head()
The data above is arranged in a single multi index column (with region being the postcode code and ANCP being the ancestry country code), so we need to pull the indexes into columns and then we can access the columns we need more easily:
#using reset_index removes all the levels of a multi-index column by default
df_ancestry = data.reset_index(name='Value')
print(df_ancestry.shape)
df_ancestry.head()
#Take only 2011 data
df_ancestry = df_ancestry[(df_ancestry['TIME_PERIOD']=='2011')]
#The data includes six different measures of ancestry (father only, mother only, both).
#in this case the only measure i'm interested in the totals for each ancestry
df_ancestry = df_ancestry[(df_ancestry['MEASURE']=='TOT')&(df_ancestry['ANCP']!='TOT')]
#pivot the Ancestry to columns in preparation for k-means analysis
df_ancestry = df_ancestry.pivot(index='REGION',columns='ANCP', values='Value')
df_ancestry.head()
# set number of clusters
k = 10
# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_ancestry)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:100]
df_ancestry.insert(0,'Ancestry Cluster',kmeans.labels_)
df_ancestry.head()
The table above refers to REGION and ANCP (ancestry) by codes. We need to get the xml from the ABS which describes what each code value is and and use it to merge the name of the suburb back into the data.
#an xml parser is required to get the codes from the Australian Bureau of Statistics xml file
import xml.etree.ElementTree as ET
#the xml file of codes from the ABS
user_agent_url = 'http://stat.data.abs.gov.au/restsdmx/sdmx.ashx/GetDataStructure/ABS_CENSUS2011_T09'
xml_data = requests.get(user_agent_url).content
#create the xml tree from the file
tree = ET.fromstring(xml_data)
#Create a dataframe which will include all the codes
df_codes = pd.DataFrame(columns=['code','value'])
#parse the structure xml to get the codes:
for child in tree:
for lower in child:
for codelist in lower:
for description in codelist:
if(description.attrib.get('{http://www.w3.org/XML/1998/namespace}lang')=='en'):
code = codelist.attrib.get('value')
value = description.text
temp_df = pd.DataFrame([[code,value]],columns=['code','value'])
df_codes=pd.concat([df_codes,temp_df])
#set the code as the index of the table
df_codes.set_index('code', inplace=True)
#strip whitespaces from the values column
df_codes['value'] = df_codes['value'].str.strip()
df_codes.head(12)
#merge the codes back into the Ancestrt dataframe
df_ancestry = df_ancestry.join(df_codes)
We now have a dataframe of Suburbs with Ancestry clusters. We'll clean this dataframe up as well as the original dataframe containing the Venue Cluster so that we can merge them in the subsequent step on suburb name.
# select only the required columns, fix the names and make suburb lowercase for searching
df_ancestry = df_ancestry[['value','Ancestry Cluster']]
df_ancestry.rename(columns={"value":"suburb"}, inplace=True)
#strip out whitespace, change to lowercase
df_ancestry['suburb'] = df_ancestry['suburb'].str.lower()
df_melbourne_merged['suburb']=df_melbourne_merged['suburb'].str.lower()
df_ancestry.sort_values(by=['suburb'], inplace=True)
df_melbourne_merged.sort_values(by=['suburb'], inplace=True)
#merge on suburb
df_complete = df_melbourne_merged.merge(df_ancestry, how='left', on='suburb')
#and pick only relevant columns
df_complete = df_complete[['postcode','suburb','state','Venues Cluster','Ancestry Cluster']]
#and see how many suburbs didn't find a matching Ancestry Cluster
df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb'].shape
The above shows that 296 of our suburbs did not have a entry in the Ancestry dataframe when we merge directly on Suburb name. We'll need to do some looser matching which gets this number down to 93. Note that the remaining 93 are not really suburbs - they are standalone postcodes used by Australia Post
#where suburb couldn't be matched directly, try to find the suburb name as a substing in df_ancestry:
for suburb in df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb']:
if(len(df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb)]['Ancestry Cluster'])>0):
df_complete.loc[df_complete[df_complete['suburb']==suburb].index,'Ancestry Cluster']=df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb)]['Ancestry Cluster'][0]
# where there is still no macth, try to split the suburb and find the sub string. This will allow, for example, 'Altona North' to get the cluster for 'Altona'
for suburb in df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb']:
suburb2 = suburb.split()[0].strip()
if(len(df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb2)]['Ancestry Cluster'])>0):
df_complete.loc[df_complete[df_complete['suburb']==suburb].index,'Ancestry Cluster']=df_ancestry.loc[df_ancestry['suburb'].str.contains(suburb2)]['Ancestry Cluster'][0]
#and see how many suburbs didn't find a matching Ancestry Cluster
df_complete[pd.isnull(df_complete['Ancestry Cluster'])]['suburb'].shape
#all the suburbs without an Ancestry Cluster are not actually recognisable suburbs, so we can drop these:
df_complete.dropna(inplace=True)
df_complete.head()
Assign age and income brackets¶
Most real estate search tools offer suburb profiles which include information prospective buyers/renters would typically be interested in, such as recent sold prices as well as demographic information. This project will utilize limited demographic information scraped from RealestateVIEW.com.au; Median Age and Median Weekly Income.
#define a function which takes a suburb name and scrapes the web for median age and household income statistics
def get_age_income(state,suburb):
#Format the url
url='https://propertydata.realestateview.com.au/propertydata/suburb-profile/'+state+'/'
suburb = suburb.split()
i=0
for i in suburb:
url=url+i+"+"
age = 0
income = 0
#get the page text for the suburb
page = requests.get(url).text
soup = BeautifulSoup(page, 'lxml')
#find all the important elements
sp = soup.find_all('b')
#get the median age and income for the suburb
for element in sp:
if element.text=='Median Age':
age = element.next_sibling.next_sibling.translate({ord(i): None for i in '\n\t'})
if element.text=='Weekly Household Income':
income = element.next_sibling.next_sibling.translate({ord(i): None for i in '\n\t'})
return age, income
#insert columns for median age and weekly income
df_complete.insert(4,'Age',0)
df_complete.insert(4,'Income',0)
#fetch Median Age and Weekly Income from https://propertydata.realestateview.com.au/
print('Getting Median Age and Weekly Income from https://propertydata.realestateview.com.au/, please wait...')
for row in df_complete.itertuples():
result = get_age_income('victoria',row.suburb)
df_complete.loc[df_complete['suburb']==row.suburb,'Age']=result[0]
df_complete.loc[df_complete['suburb']==row.suburb,'Income']=result[1]
print('...complete')
df_complete.head()
We now have a complete list of suburbs with clusters assigned based on the venues in the area and the ancestry of the people, as well as the relative age and income of the inhabitants
Recommend a Suburb¶
#postcode and state are not required for the recommendation calculation
similar_sort = df_complete.drop(['postcode','state'], axis=1)
#Update the index and clean up the column types
similar_sort.set_index('suburb', inplace=True)
similar_sort['Age']=similar_sort.Age.astype(float)
similar_sort['Income']=similar_sort.Income.astype(float)
#normalize the data into a new dataframe
from sklearn.preprocessing import MinMaxScaler
x = similar_sort.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
feature_mtx [0:5]
df_normal = pd.DataFrame(feature_mtx, index=similar_sort.index, columns=similar_sort.columns)
# The following function takes in a suburb and returns the top 10 similar suburbs
def subrec(suburb):
try:
#choose a suburb for comparison
new_suburb = suburb.lower()
new_suburb_df = list(df_normal.loc[new_suburb])
#get the absolute difference between the new suburb and all other suburbs
temp_df = df_normal-new_suburb_df
temp_df = temp_df.abs()
recommendations = temp_df.sum(axis=1).sort_values().head(10)
return recommendations
except:
print("Error: Suburb not found")
Using the function for a given suburb gives the similarity scores for the top 10 similar suburbs
suburb = 'Prahran'
results = subrec(suburb)
results
We can see more details for the top 10 including their feature values:
detailed_results = df_complete[df_complete['suburb'].isin(results.index)]
detailed_results
Map the results¶
!pip install folium
import folium # map rendering library
latitude=-37.814563
longitude=144.970267
print(latitude,longitude)
detailed_results = df_melbourne_merged[df_melbourne_merged['suburb'].isin(results.index)]
detailed_results
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
colors_list = ['Red','Blue','Green','Yellow','Purple','Pink','Orange']
#add markers to the map
markers_colors = []
for lat, lon, suburb in zip(detailed_results['lat'], detailed_results['lon'], detailed_results['suburb']):
label = folium.Popup(str(suburb), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
#color=colors_list[int(cluster)],
fill=True,
fill_opacity=0.7).add_to(map_clusters)
map_clusters